Vol. 30 ECCB 2014, pages i609-i616 
doi: 1 0. 1 093/bioinformatics/btu4 72 



The impact of incomplete knowledge on the evaluation of protein 
function prediction: a structured-output learning perspective 

Yuxiang Jiang\ Wyatt T. Clark\ Iddo Friedberg^'^ and Predrag Radivojac^ '"^ 

^Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, ^Department of 
Microbiology and ^Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA 



ABSTRACT 

Motivation: The automated functional annotation of biological macro- 
molecules is a problem of computational assignment of biological 
concepts or ontological terms to genes and gene products. A 
number of methods have been developed to computationally annotate 
genes using standardized nomenclature such as Gene Ontology (GO). 
However, questions remain about the possibility for development of 
accurate methods that can integrate disparate molecular data as well 
as about an unbiased evaluation of these methods. One important 
concern is that experimental annotations of proteins are incomplete. 
This raises questions as to whether and to what degree currently 
available data can be reliably used to train computational models 
and estimate their performance accuracy. 

Results: We study the effect of incomplete experimental annotations 
on the reliability of performance evaluation in protein function predic- 
tion. Using the structured-output learning framework, we provide the- 
oretical analyses and carry out simulations to characterize the effect of 
growing experimental annotations on the correctness and stability of 
performance estimates corresponding to different types of methods. 
We then analyze real biological data by simulating the prediction, 
evaluation and subsequent re-evaluation (after additional experimental 
annotations become available) of GO term predictions. Our results 
agree with previous observations that incomplete and accumulating 
experimental annotations have the potential to significantly impact ac- 
curacy assessments. We find that their influence reflects a complex 
interplay between the prediction algorithm, performance metric and 
underlying ontology. However, using the available experimental data 
and under realistic assumptions, our results also suggest that current 
large-scale evaluations are meaningful and almost surprisingly reliable. 
Contact: predrag@indiana.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 



1 INTRODUCTION 

Assigning function to gene products is of primary importance in 
biology, yet with the overwhelming abundance of sequence data 
in the post-genomic era, only a small fraction of gene products 
have been annotated experimentally. Therefore, it has become 
important in computational biology to predict function based on 
sequence, structure and other data when experimental annota- 
tions are unavailable (Friedberg, 2006; Punta and Ofran, 2008; 
Rentzsch and Orengo, 2009). With the development of a large 
number of function prediction methods, there is a need for 
unbiased assessment of these methods. Community-based 
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challenges, such as MouseFunc (Pena-Castillo et aL, 2008) and 
the Critical Assessment of Functional Annotation (CAFA) 
(Radivojac et aL, 2013), have emerged to address this problem 
through the prediction of ontological annotations. The objective 
in the CAFA challenge, for example, is to predict a set of Gene 
Ontology (GO) terms associated with a given protein, where GO 
is a hierarchical knowledge representation of functional descrip- 
tors (terms) organized in a large directed acyclic graph 
(Ashburner et aL, 2000). 

To evaluate the performance of function prediction methods 
properly, a set of metrics needs to be estabhshed. At the same 
time, an important challenge we face when assessing the perform- 
ance of these methods is that because of the incremental nature 
of scientific discovery, our knowledge of any given protein's 
function is likely to be partial. Therefore, a function prediction 
that is originally assessed as a false positive may be discovered to 
be a true positive at a later stage, and similarly, a prediction that 
is initially assessed as a true negative may later be discovered to 
be a false negative. This problem of incomplete data has led to 
doubts regarding the reliabihty of evaluations of protein function 
prediction algorithms (Dessimoz et aL, 2013; Huttenhower et aL, 
2009). 

The problem of incomplete data in the training and assessment 
of classifiers has been recognized both in computational biology 
(Huttenhower et aL, 2009) and machine learning (Elkan and 
Noto, 2008; Rider et aL, 2013). Huttenhower et aL have con- 
cluded that the effect of missing annotations can produce mis- 
leading evaluation results because the classifiers are differentially 
impacted. This results in the re-ranking of classifiers upon re- 
evaluation at a time when more experimental annotations are 
available. Similarly, Rider et aL (2013) studied the problem in 
the framework of asymmetric class-label noise, in which negative 
examples contain some mislabeled data points. However, both 
studies only considered a binary classification scenario, e.g. when 
a predictor is developed for a particular term in the ontology. 

Here we study the effect of incomplete experimental annota- 
tions on the quality of performance assessment of protein func- 
tion prediction methods. To that end, we consider protein 
function prediction as a structured-output learning problem in 
which a classifier is expected to output a totahty of (interdepend- 
ent) GO terms for a given sequence. We consider both topo- 
logical and information-theoretic metrics and analytically 
derive under what conditions the initial performance evaluation 
will underestimate or overestimate the true accuracy. Then, we 
provide simulations to characterize the impact for different types 
of predictors. Finally, we analyze experimental protein function 
data by simulating the CAFA experiment. Our results regarding 
the potential impact of incomplete data on correctness of evalu- 
ation largely agree with previous studies. However, under 
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realistic assumptions, we provide evidence that the impact of 
missing data on rehable evaluation is surprisingly small. As a 
result, this study provides confidence that large-scale perform- 
ance evaluation of protein function predictors is useful. In add- 
ition, our study raises concerns about potentially different 
conclusions that can be reached when protein function is studied 
as a series of binary classifiers versus using a structured-output 
learning formulation. 



2 METHODS 

2.1 Protein function prediction formulation 

We consider protein function prediction within a structured-output learn- 
ing framework. Given a training set of labeled input objects {(x/,^/)}.^^ 
where each x e X and y ey, the objective is to infer a relationship 
f:?i^y that minimizes some loss function. The input space X consists 
of proteins, whereas the output space 3^ is a set of all consistent subgraphs 
of some underlying directed acyclic graph such as GO. Here, by saying 
'consistent', we mean that if any term v from GO is used to annotate a 
protein x G A' (either experimentally or computationally), all the ances- 
tors of V up to the root(s) of the ontology must also be included as an 
annotation of x. 



2.2 Evaluation of prediction accuracy 

To understand how the incomplete functional annotations influence ac- 
curacy estimation, we first review two representative evaluation schemes 
in this domain. For simplicity, we shall consider that the evaluation set, 
either through a hold-out set or cross-validation, consists of a single 
protein with its experimental annotation T and a prediction P created 
by some classifier. This single protein can be considered to be an average 
case when a larger evaluation set is available. We use V to refer to the 
entire set of terms in the ontology; thus, with a minor abuse of notation, 
we write that 7" c F and P c K. Typically, F is a large graph with tens of 
thousands of nodes, Tis a small graph with about 10-100 nodes, whereas 
P depends on a particular classifier under evaluation. 

F-measure. Given a protein with its non-empty consistent annotations 
T and P, the precision (pr) and recall (rc) are calculated as 



pr{P,T)- 



\pr\T\ 



and rc{P, T) = 



\pr\T\ 



where \P\ is the number of predicted terms, \T\ is the number of experi- 
mental terms, and |P n 7"! is the number of correctly predicted terms by 
the model. We will also refer to these quantities using the following no- 
tation: tp=\PC\T\ as true-positive findings, ^= \P — T\ as false-positive 
findings, tn=\P f^T\ as true-negative findings and fn=\T — P\ as false- 
negative findings. Here, P=V — P is the complement of P, and T is 
the complement of T. The F-measure of the predicted annotation is 
defined as 



Ff,(P,T) = {l- 



priP, T) ■ rciP, T) 
0^pr{P, T) + rciP,T)' 



where ^ is sl positive number. Here we only consider /3= 1, which results 
in the F-measure that represents a harmonic mean between precision and 
recall. 

Semantic distance. Semantic distance is based on an assumption that 
the prior probability of a protein's experimental annotation can be mod- 
eled by a Bayesian network in which the conditional probability tables are 
calculated from data (Clark and Radivojac, 2013). Here, we calculate 
misinformation (mi) and remaining uncertainty (ru) as 

mi(P, T)= ia(v) and ru(P, T)= ^ ia(v), 



where v e Fis a vertex in the graph, V(v) is a set of its parents, Fr(v\V(v)) 
is the probability that a protein is experimentally annotated by v given 
that all its parents are a part of the annotation, and ia(v) = - log 
(Pr(v|P(v)) is information accretion. 

Semantic distance between two consistent graphs P and T is defined as 

Sk{P, T) = {ru^{P, T)^mi\P, T)f 

for any k>\. Here we only consider k = 2; thus, is the Euclidean 
distance between the point (ru(P, T), mi(P, T)) and the origin of the co- 
ordinate system. 

It is important to mention that both evaluation schemes are applied at 
the protein level. That is, they provide prediction accuracy on each test 
protein and are typically averaged over a set of proteins. A different 
group of metrics, those that evaluate a predictor's performance for a 
particular term v in the ontology (e.g. 'catalytic activity'), may also be 
used. In this case, an area under the receiver operating characteristic 
curve and the term-based F-measure are routinely used (Sharan et al, 
2007). However, these metrics are reliable only if a test set contains 
a sufficiently large number of proteins experimentally annotated by 
term v. In addition, it is not clear how to average term-based evaluations 
over all (interdependent) terms in the ontology to rank two classifiers on 
a given set of proteins. Thus, term-based metrics are not considered in 
this work. 

2.3 The problem of incomplete annotations 

The process of functional annotation of proteins starts with experimental 
research in which a protein is revealed to be involved in certain biochem- 
ical and cellular activities. After the publication, this information is fur- 
ther processed by biocurators who then assign functional annotation to 
the protein using standardized vocabularies such as GO. Each protein- 
term association is further assigned a set of evidence codes that document 
the nature of the experiment used to interrogate a protein's activity. 

Although this process has several layers of control, there are a number 
of challenges that influence the quality of annotation. For example, pro- 
teins may be multifunctional and display different types of activities 
under different biological and experimental contexts. Similarly, owing 
to the technical limitations the assays used to determine their function 
may not provide sufficiently detailed evidence. For those reasons, while 
the experimental annotations assigned to a protein are generally reliable, 
they are unlikely to be complete. Two important questions, both theor- 
etically and empirically, emerge: (i) How useful are these available anno- 
tations for training of computational models? (ii) To what extent can 
these partial annotations be reliably used to estimate the accuracy of 
computational models before the complete experimental annotation is 
available? We focus on the problem of evaluation. 

To formalize our approach, we consider two evaluation scenarios: the 
original evaluation on incomplete data and another evaluation at a later 
point in time when additional experimental annotations become avail- 
able. For simplicity, we shall refer to the latter scenario as evaluation on 
complete data. We will use T to denote a protein's incomplete experimen- 
tal annotation, T' to denote the complete annotation, where T ^ T', and 
refer to V - T as new annotation. Predicted annotations P are generated 
only once, at the time of original evaluation. Applying the same naming 
convention, we use pr, rc, etc., to refer to precision, recall, and other 
incomplete-data metrics, respectively, whereas we use p/, rc' , etc., for 
their equivalents on complete data. Our goal is to study and understand 
the relationship between corresponding metrics, such as pr and pr' , rc and 
rc' , Fi and F{, etc. 

2.4 The impact of new annotations on Fi 

We analyze the impact of incomplete annotations on accuracy estima- 
tion using two confusion matrices, one provided at the time of original 
evaluation and the other at a later point in time when additional 
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Fig. 1. Confusion matrix on (A) incomplete data and (B) complete data 



experimental annotations become available (Fig. 1). We assume that 
there slvq S=\T^ — T\ new annotations; of those, a terms were predicted 
as negatives by the predictor, whereas ^ terms were predicted as positives. 
The incomplete-data precision (pr) and recall (rc) are defined as 



pr = 



tp 



and rc = 



tp 



tp+fp tp+fn' 
whereas the F-measure can be written as 

2tp 



Fi = 



2tp+fp+fn' 



On the other hand, as illustrated in Figure 1, the complete-data precision 
(p/) and recall (rc') can be expressed as 

pr = - — and rc = - 



tp+fp tp+fn + a^^' 

The complete-data F-measure {F\) then becomes 



2-{tp + ^) 



2tp+fn+fp + a + ^ 



We first observe that pr' > pr because ^ >0. The change of precision 
\r=pr' — pr = ^/(tp+fp). On the other hand, the relationship between 
rc' and rc is less obvious. We express the change in recall A^^ = rc' — rc as 



tp + ^ 



tp 



tp^fn^a^^ tp+fn 
^ ^-fn-a-tp 
(tp+fn)(tp+fn + a + 

By recognizing that rcs =/3/(q; + /3) is the recall on the new annotations, 
it follows that the originally estimated recall increases if rcs > rc. 
Similarly, for the change in F-measure, we derive that 



Af,=F',-F,=2 



l3-c-(a^l3)-tp 
c-(c + a + ^) ' 



where c = 2tp -^fp+fn. The above equation leads to the following unex- 
pectedly simple result 

, f>0 iirc,>]-F, 
I <0 otherwise 

We see that to maintain the performance upon receiving new annotations, 
the predictor must have a recall on T' — T greater than one half of the 
original F^. 



2.5 The impact of new annotations on S2 

We illustrate the analysis of information-theoretic metrics in Figure 2. 
Analogous to the analysis of Fi, we introduce three non-negative 
quantities to describe the impact of new annotations on the semantic 
distance as 



E 



iaiv), p= J2 '"^"^ 

veT'-T-P ye{T'-T)nP 



root of ontology 




ru{P, T) 



Fig. 2. Illustration of the changes of remaining uncertainty and misinfor- 
mation after the set of experimental annotations changes from T to T , 
where 7" c 7"^ P is a set of predicted terms determined by leaf annotation 
nodes p\ and p2', 2" is a set of incomplete-data experimental terms deter- 
mined by ti and T' is a set of complete-data experimental terms deter- 
mined by leaf annotation nodes /i, ^2, h and t^. The changes are illustrated 
(A) using a Venn diagram and (B) using a directed acyclic graph 



and 8 = a + ^. Thus, the complete data remaining uncertainty and misin- 
formation become 

ru' ^ru + a and mi' = mi — ^. 

Note that (ru, mi) e M^; thus, 5*2 can be recognized as the L2-norm of this 
vector. The absolute value of the change of semantic distance A^^ can be 
bounded as follows 

|A5j = 1^2-^21 

= \\\(ru + a, mi-m2-\\(ru, mOlbl 

< /3)||2 (by Minkowski inequality) 

However, without further assumptions, the difference in semantic dis- 
tance can either be positive or negative depending on the performance 
of a predictor on the new experimental annotations. 

2.6 Simulations 

Here we show the simulation results for both topological and informa- 
tion-theoretic metrics under reasonable assumptions. For the simulation 
related to F-measure, we assume that the recall on new data equals the 
recall on complete data as well as statistical independence between 
(tp+fri) and 8, which together define the level of incompleteness. 
Values of these two variables were sampled from a distribution estimated 
using new experimental annotations from Swiss-Prot provided between 
January 2011 and January 2014; see next section. The parameter /3 was 
generated from a binomial distribution Bm(8, rc). 

The simulation for semantic distance was conducted in an analogous 
way. Here we sampled y^J2veT^^(^) ^ independently and generated 
the ratio ^/8 using a Beta distribution B(y — ru, ru). While the beta dis- 
tribution was chosen out of convenience, it allowed us to control that 
E[/3/5] = (y — ru)/ru, i.e. that the fraction of information content of 7" that 
was correctly predicted was unchanged on the new annotations. 

Note that in both cases, trials were discarded if the generated values 
were invalid, i.e. if ^>fp (Fig. 3) or y<ru (Fig. 4). Figures 3 and 4 show 
the impact averaged over 10000 trials for each GO classification. 
Simulation results under several other conditions are provided in 
Supplementary Materials. 



3 EXPERIMENTS AND RESULTS 

To analyze the impact of new annotations on accuracy esti- 
mates of protein function predictors, we designed a hypothetical 



1611 



Y Jiang et al. 



Molecular function 


Biological process 


Cellular component 




m 














U 








0 .2 .4 .6 .8 1 
Recall 




1 0 .2 .4 .6 .8 1 
Recall 



0.2 
0.1 
0 

-0.1 
-0.2 



Fig. 3. Simulated changes of Fy. (A) Absolute changes and (B) relative 
changes, as a function of precision and recall estimated on incomplete 
data 
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Fig. 4. Simulated changes of ^^2. (A) Absolute changes and (B) relative 
changes, as a function of misinformation and remaining uncertainty esti- 
mated on incomplete data 



prediction challenge similar to CAFA (Radivojac et al., 2013), 
with three methods as virtual participants: GOtcha, BLAST and 
Swiss-Prot computational annotation (evidence codes: ISS, ISO, 
ISA, ISM, IGC, IBA, IBD, IKR, IRD, RCA and lEA). Swiss- 
Prot experimental annotations (EXP, IDA, IMP, IPI, IGI, lEP, 
TAS and IC) were used as gold standard. 

The evaluation was performed as follows: (i) All methods were 
trained on 42 570 experimentally annotated proteins from the 
January 2010 version of Swiss-Prot and then initially evaluated 
on proteins that were un annotated in 2010 but acquired anno- 
tations between January 2010 and January 2011, (ii) January 
versions of Swiss-Prot from 2012, 2013 and 2014 were then 
used to collect a set of proteins that were experimentally anno- 
tated between 2010 and 2011, but then acquired additional an- 
notations after January 2011. Such experiment allowed us to 
understand the extent to which new experimental annotations, 
3 years after original data collection, affected the initial esti- 
mates of performance accuracy. The experiment is illustrated in 
Figure 5. 

We refer to the performance evaluation on the 2011 data as 
initial evaluation. Similarly, the evaluation on the data from sub- 
sequent years is referred to as re-evaluation. Each re-evaluation 
was performed as if the data at that point was complete. Finally, 
we note that all proteins for which the Swiss-Prot curators 
removed any GO terms at any point between 2011 and 2014 
were ignored in this experiment. This excluded 2323 (53%; of 
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Fig. 5. The timeline of data collection and evaluation. The light green 
area represents the experimental annotations of the 2011 test set, while 
the dark green area represents new annotations 



those, 1899 were related to the removal of term 'protein binding') 
proteins in the Molecular Function category, 1088 (22%) in the 
Biological Process category and 672 (15%) proteins in the 
Cellular Component category. 



3.1 Participating methods 

BLAST This method assigns a score to a sequence-term asso- 
ciation according to the highest sequence similarity between the 
target sequence and any of the training sequences that are ex- 
perimentally associated with that term. More formally, the score 
for the target sequence q and term v was computed as 
Bv(q) = msLXs^Sy{—^oge(q,s)], where e(-,-) denotes the E- value 
between two sequences as returned by the BLAST ahgnment 
(Altschul et al., 1997), and Sy is a set of training sequences ex- 
perimentally annotated with term v. The cutoff for E-value was 
set to 1.0 such that all scores were non-negative. 

GOtcha Instead of assigning the maximum similarity among all 
sequences associated with a particular term v, GOtcha aggregates 
the evidence collected from multiple similar sequences (Martin 
et al., 2004). A normalization step is subsequently taken to 
enforce that the final score is within the range [0, 1]. That is, 
G,(q) = r,(q)/r,oot(q), where r,(q)= - E^e^, log K^, ^) is called 
the r-score for the sequence q and term v. 

Swiss-Prot As the third method for comparisons, we simply 
retrieved all non-experimental annotations from the January 
2010 version of Swiss-Prot. 



3.2 Evaluation metrics 

The first set of evaluation metrics was identical to those used in 
the CAFA challenge (Radivojac et al., 2013). Because BLAST 
and GOtcha output soft scores, their performance is visualized 
using precision-recall curves. As a single number to rank the 
methods, we calculate i^max , which is the maximum F-measure 
over the entire range of decision thresholds (Radivojac et al., 
2013). Similarly, we calculate misinformation and remaining un- 
certainty over the range of possible decision thresholds and use 
the minimum semantic distance to rank methods (Clark and 
Radivojac, 2013). 

We note that Swiss-Prot experimental annotations do not have 
confidence scores; thus, the Swiss-Prot computational annota- 
tions can be presented as a single point in the pr-rc or the 
ru-mi plane. 
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3.3 Results 

Figures 6 and 7 show the performance of each method in the pr- 
rc and ru-mi planes, separately for each of the three classifica- 
tions in GO. Here, the incomplete-data evaluations (2011) for 
each method are shown as solid gray curves. On the other hand, 
color-coded curves are used to visualize the performance on in- 
complete data (solid curves, full circles) and complete data 
(dotted curves, empty circles) on the subsets of proteins that 
acquired new annotations in the periods from 2011 to 2012 
(green), 2011 to 2013 (blue) and 2011 to 2014 (red). Note that 
the complete-data evaluations on the entire set of proteins are 
not shown as the curves would overlap solid gray curves. 
Tables 1-4 summarize the performance evaluation using i^^ax 
and Srmn after a 3 year period during which new annotations 
were allowed to accumulate in the Swiss-Prot database. 

The results shown in Figures 6 and 7 and Tables 1-4 provide 
evidence that the effects of incomplete annotations are ontology- 
specific, metric- specific and algorithm-specific. The impact on 
Fmax was relatively small; within 2 percentage points on the 



entire dataset (Table 1) and within 5 percentage points when 
the evaluation was restricted to the proteins that accumulated 
new annotations (Table 2). The impact on Smm , on the other 
hand, was larger and suggests that the initial evaluations are 
consistently overestimating the quality of performance. The over- 
all change of S^mn in the 3 year period was within 1 bit of 
information for Molecular Function and Cellular Component 
classifications, and within 3 bits for Biological Process 
(Table 3). When the evaluation was restricted to the subset of 
proteins that accumulated new annotations, this difference 
became more significant: within 6 bits for Molecular Function 
and Cellular Component and within 12 bits for Biological 
Process (Table 4). 



4 CONCLUSIONS 

Incomplete experimental annotation of protein function affects 
both the development of computational function prediction 
methods and their unbiased evaluation. In this work, we 
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Fig. 6. Precision-recall evaluation on each GO classification using (A) GOtcha, (B) BLAST and (C) Swiss-Prot predictions from January 2010. Solid 
gray curves (filled circles) represent performance of each method using proteins that did not have experimental annotation in 2010 but were experi- 
mentally annotated in January 2011. Color curves show evaluation on those proteins from the 2011 set that accumulated new annotations in 2012 
(green), 2013 (blue) and 2014 (red). Solid curves (filled circles) show incomplete-data evaluation on each subset of proteins, whereas dotted curves (empty 
circles) show complete-data evaluation 
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Fig. 7. Information- theoretic evaluation on each GO classification using (A) GOtcha, (B) BLAST, and (C) Swiss-Prot predictions from January 2010. 
Solid gray curves (filled circles) represent performance of each method using proteins that did not have experimental annotation in 2010 but were 
experimentally annotated in January 2011. Color curves show evaluation on those proteins from the 2011 set that accumulated new annotations in 2012 
(green), 2013 (blue) and 2014 (red). Solid curves (filled circles) show incomplete-data evaluation on each subset of proteins, whereas dotted curves (empty 
circles) show complete-data evaluation 



Table 1. Impact on i^n^ax between 2011 and 2014 averaged over all test proteins from the 2011 evaluation 



Molecular function Biological process Cellular component 



^ ^ ^ max ^ max ^^max ^ max ^ max ^^max ^ max ^ max ^-f'n 



GOtcha 0.582 0.583 +0.001 0.388 0.399 +0.011 0.612 0.606 -0.006 

BLAST 2065 25711 0.455 0.464 +0.009 3971 27771 0.297 0.316 +0.019 3750 27249 0.454 0.465 +0.010 
Swiss-Prot 0.500 0.504 +0.004 0.315 0.321 +0.006 0.478 0.468 -0.010 



Note: The change of F^ax is influenced by a subset of proteins that accumulated new annotations in the 3 year period. N is the number of test proteins, whereas N"' is the 
number of training proteins. 



addressed the evaluation problem, both theoretically and empir- 
ically, by considering function prediction within a structured- 
output learning framework. We find that the nature and level 
of incompleteness in the data, types of classification models and 
performance evaluation metrics are all contributing factors to the 
complexity of understanding evaluation bias. 



Both simulations and empirical evaluation suggest that the 
influence of incomplete annotations on topological metrics, 3 
years after the initial assessment, may result in either overesti- 
mated or underestimated performance, but the effect is generally 
small. The reason for this can probably be found in the cancel- 
lation of effects of increasing precision and potentially decreasing 
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Table 2. Impact on F^ax between 2011 and 2014 averaged over those test proteins from the 2011 evaluation that acquired new annotations until re- 
evaluation 



Molecular function Biological process Cellular component 



N Mp fZI (Coverage) A^^„^^ N fZl (Coverage) A^^^^^ N Mp F^l (Coverage) 



GOtcha 0.569 (92.8%) -0.004 0.381 (90.8%) +0.038 0.599 (93.2%) -0.016 

BLAST 279 0.60 0.435 (92.8%) +0.049 1130 0.77 0.300 (90.8%) +0.054 865 0.33 0.440 (93.2%) +0.047 
Swiss-Prot 0.531 (76.3%) +0.030 0.308 (58.9%) +0.009 0.441 (55.1%) -0.045 



Note: N is the number of test proteins; Mp is the median of p= \ T' — T\/\T\ over all test proteins. 



Table 3. Impact on 5*111111 between 2011 and 2014 averaged over all test proteins from the 2011 evaluation 



Molecular function Biological process Cellular component 



N As^^^ N N- A,^,„ N A,„ 



GOtcha 6.32 6.89 +0.57 14.10 17.10 +2.99 4.70 5.67 +0.96 

BLAST 2065 25 711 7.87 8.57 +0.70 3971 27 771 16.60 19.50 +2.89 3750 27249 5.64 6.49 +0.85 
Swiss-Prot 6.48 6.88 +0.40 13.90 17.10 +3.19 4.68 5.71 + 1.03 



Note: The change of S^in is influenced by a subset of proteins that accumulated new annotations in the 3 year period. N is the number of test proteins, whereas N^'' is the 
number of training proteins. 

Table 4. Impact on 5*111111 between 2011 and 2014 averaged over proteins from the 2011 evaluation that acquired new annotations until re-evaluation 



Molecular function Biological process Cellular component 



N Mp SZl (Coverage) A^^ ^^on (Coverage) A5.,,, Mp Sf^^ (Coverage) A5., 



GOtcha 6.69 (92.8%) +4.35 16.20 (90.8%) + 10.40 5.24 (93.2%) +3.93 

BLAST 279 0.72 8.26 (92.8%) +5.52 1130 0.67 20.80 (90.8%) +9.73 865 0.85 6.80 (93.2%) +3.35 

Swiss-Prot 7.24 (76.3%) +2.95 16.30 (58.9%) + 11.40 5.39 (55.1%) +4.45 



Note: N is the number of proteins; Mp is median of p = 8/y over all test proteins. Coverage represents the percentage of proteins on which a method outputs a prediction. 



recall — resulting, surprisingly, in relatively stable F-measure 
estimates. On the other hand, information- theoretic metrics 
appear to be more sensitive to data incompleteness. Here we 
fmd that the semantic distance generally increases on real data, 
which is likely a consequence of the fact that the newly added 
experimental annotations are typically deep in the ontology and 
thus contribute significantly toward the overall performance 
measurements. 

The classification algorithms are differentially impacted by 
data incompleteness as well. In particular, we observe that 
those tools that operate in the low-precision high-recall region 
of the pr-rc plane are the most significantly impacted by missing 
annotations. Finally, we find that the level of incompleteness in 
Swiss-Prot is not large enough to seriously impact accuracy as- 
sessments. The data available to us, however, were only analyzed 
3 years after the initial evaluation, therefore not significantly 
eliminating the possibility for larger changes in light of novel 
biological discoveries. 

In summary, evaluation of function prediction methods at any 
given time is not error-free. When there is an interest in specific 



functions for specific gene products, predictions that are dis- 
covered at a later time to be true-positive predictions or false- 
negative predictions because of missing data may be problematic. 
However, when comparing methods on large datasets, assessing 
the performance of protein function prediction methods provides 
meaningful information about their accuracy and usefulness, 
with a quantifiably low error rate. 
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